by Yanhua He
Data set can be downloaded through this link Loan Data from Prosper.
This variable dictionary explains the variables in the data set.
In this project, I will use R and apply exploratory data analysis techniques to explore relationships in one variable to multiple variables and to explore prosper loan data set for distributions, outliers, and anomalies.
Referring to Wikipedia:
prosper has a transaction-based business model, in which the company collects revenue by taking a fee on its customers’ transactions. Borrowers who receive a loan pay an origination fee of 1.00% to 5.00% depending on the borrower’s Prosper Rating, and investors pay a 1% annual servicing fee.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
The dataset contains information of loans from the inception of Prosper.com in 2005 till the 2014 March. The questions I want to explore are:
From the graph, we can see that there are much more borrowers in CA than in than other states. Considering that Prosper is San Francisco based company, it makes sense that it might be more popular in its homebase. Also, prosper had a large number(>5000) of customer base in FL,GA,IL,NY and TX. According to wikipedia, states (Iowa, Maine, and North Dakota) not permitted to borrow throughProsper, that is way, there is no data for those states.
We can see that borrowers are from all kinds of professions. Also, there are many people chose “other” or “professional” not specifyting the fields of their occupation.
The bar chart presents the amount of borrowers’ by income range groups.The verified income for most borrowers range from $25,000 to $74,999.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
It’s a fairly symetric distribution, with its midpoint at 0.22 and the majority of borrowers have debt to income ratio less than 0.5.However, we could see there is a tail with a few borrowers having >0.75 DebtToIncomeRatio.
Plot a histogram to show the distribution of the CreditScore of prosper borrowers.We can see from the plot, there are many whitespace between the CreditScore values, most borrowers have credit score around 700.Also, there are a few borrowers with score lower than 600 and I am quite curious about what rate they can get from Prosper.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
It is very positively skewed that the majority of borrowers don’t have any delinquencies. And there are some customers who have more than 10 delinquencies, but still could get loan from Prosper.
ProsperRating is applicable for loans originated after July 2009 in forms of alpha and numeric, where AA(or 10) is the lowest risk down to HR(or 1) which stands for high risk.Take a look at how the amount of customers distribute in different ProsperRatings. A middle rating C is mostly common for customers, and only a small fraction of customers are rated AA.
ProsperScore is a custom risk score and is applicable for loans originated after July 2009. Most customers have scores between 4 and 8.
We can see the amount of loans dropped significantly from 2008 Q4, it seems that Prosper.com had lost a lot of its prosperity.According to Wikipedia, in 2008 Q4, a class action lawsuit was filed against Prosper alleging its violation of the California and federal secrities laws.After its reopen in 2009, prosper loans have continued to grow and regain its popularity.
From the plot we can conclude that before 2010, there are only 36-month loans, starting from late 2010, there are 3 terms available, the most popular one is 36-month loan, and then it’s 60-month loan, 12-month term is rarely seen. Currently, it seems that 12-month term is no longer available.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The distribution of loan amount in Prosper is very positively skewed.The median amount is $6500.Most people apply for loans between $1000 and $10000.Several spikes show that $4000,$10000,$15000 are very popular choices.Few people borrow more than $25000.
The boxplot shows the loan amount applied by customers with different ratings. Most of the loans applied by HR, E rated customers are less than $5000. It seems that you need to have a rating equal or higher than B to be eligible to apply loans>$25000. This to some extent explains why larger loans get better rates because they are applied by people with better prosper ratings.Referring to a post:“Prosper’s approach is to have a sliding scale of the maximum borrower amount allowed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
The above histogram shows the distribution of monthly payment, it is left skewed with the bulk of payment per month is under $400, the median monthly payment is around $217 and the most common payment per month is $150 dollors.
It shows more than 80000 borrowers don’t have prior loans, which means most of borrowers are first timers. Also, there are more that 15,000 listings of which the borrowers had one loan before from prosper.
From Prosper’s webpage, there is an explanation about the name change for “Defaults”. In mid-August 2008, Prosper changed the way they displayed seriously delinquent loans on the marketplace performance page, renaming “Defaults” as “Charge-offs”, and moving the “4+ months late” loans into the “Charge-offs” category. Their goal was further transparency in reporting their marketplace’s default rate. I will group defaulted and chargedoff together as “Chargedoff”, and group all of the past due together as “Past Due”, and group “FinalPaymentInProgress” together with “Completed”. Reference
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
##
## Cancelled Chargedoff Completed
## 0.0000438839 0.1052511476 0.3341671274
## Current Defaulted FinalPaymentInProgress
## 0.4965551138 0.0440418828 0.0017992399
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 0.0001404285 0.0070740848 0.0023258467
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 0.0031859712 0.0027471322 0.0026681412
33.4% of the loans are completed, 49.7% of the loans are still current, and there is so far 4.4% defaulted rate and 10.5% chargedoff rate.
The Chargedoff percentage for loans orginated between 2006 and 2008 was very high, most times was even higher than 30%. Then the percentage dropped dramatically in 2009, continued to fluctuate around 0.15~0.2, and starting from 2011 Q3, the percentage was continuously dropping, which means that Prosper has improved their risk management hugely. One reminder is that some of the loans originated from 2011 are still current when the data was pulled, and most of loans originated during 2013&2014 are still current.
From the plot, we can see before 2009, the CreditGrade model performed badly to differentiate the risk levels, considering the relatively small volume of loans sold during that period, there were even more chargedoffs for credit grade from HR to C. Even for the best rating & lowest risk “AA”, there were more than 500 charged offs.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
BorrowerAPR is BorrowerRate plus fees. Looking at the Borrower interest rate, the distribution is quite symetric with median equals to 0.184 & mean equals to 0.1928, except there is a spike at 0.32.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.116 0.162 0.169 0.224 0.320 29084
It looks like a normal distribution except for the spikes at around 0.25 & 0.3. The median effective yield rate is 0.162, and maximum yield rate can be as high as 0.32.
Starting from July 2009, borrower rates are set by Prosper’s model, Prosper determines its rates using a traditional credit score and its own proprietary Prosper score based on their own historical data.
The main features of interest are how prosper determines the BorrowerRate and what would be the return to invest in Prosper as a lender as well as how safe it is.
The customers’ profile including credit score, income, debt status, bankcard information and whehter or not in good standing with Prosper will help support my investigation into how to set the borrowerRate for them.
Also, the historical data including loan status, principal loss will help to complute the risk of investing in Prosper.
I sorted the factor variable into order, such as CreditScore & ProsperRating, sort them from highest risk to lowest risk.Sort the IncomeRange Variable from lowest to highest; Change the form of several variable into date format. Also, changed the LoanOriginationQuarter’s format from “Q%q %Y” to “%Y Q%q”. And I changed the Team variable type from num to factor. For variable TotalProsperLoans, I replaced the NA values with 0, for calculation ease.
## Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: 6 6 3 5 2 4 7 7 4 5 ...
ProsperRating is a proprietary rating mechanism to determine the credit grade for borrowers and the rates as shown in the graph above. We can see from the plots, every line covers a wide range of credit score(for borrowers with same ProsperRating, there are huge variance in their credit score). For rating A, the credit score must be higher than 650, for AA there is a even higher minimal credit score requirement. There are sudden rate jump-ups for E & D rating with CreditScore higher than 850, which looks very abnormal, and there must be some hidden reason for this.
As you can see in the above graphic showing the borrowers’ interest rates based on their Prosper rating grades, the median interest Rate for an AA loan is below 10%, however for a HR loan is even higher than 30%. ProsperRating..Alpha is adopted by Prosper to determine the interest rate for borrowers.
Most of the borrowers have less than 3 inquiries during last 6 months. And there seems to be a slope to show that as you have more inquiries, at a higher possibility, you will be with a higher borrower rate.Also, there is a spike at >0.3 borrower rate, with more points have >5 inquiries recently.
ProsperScore is a custom risk score, unlike a credit bureau score, it is specifically built on the Prosper borrower population. The boxplot shows that customers assessed with lowest risk (score 11) enjoy a much lower borrower rate than those with highest risk(score 1).Median borrower rate will be lower if the score gets higher, except that between score 4 and 5, there is a fluctuation.
It is interesting to find out that before 2009, there were many borrowers with CreditScore below 600 could get loans and even some of them got very low interest rate. After the relaunch of Prosper in 2009, there is an obvious cut off of CreditScore requirement.The cut off line is close to 600, which customers must have minimal credit score of 600 to borrow through Prosper.
The line looks quite linear. As the CreditScore increases, the lower rate of a loan will get on average, and if the credit score reaches 850, the rate_mean remains quite flat.
## [1] -0.9838036
The correlation coefficient is -0.98, which means the CreditScore and the mean BorrowerRate are strongly correlated.
## [1] -0.5090406
For data after July 2009, investigate the correlation between CredietScore and BorrowerRate. The correlation coefficient is -0.51, they are some correlation, but also there are many variation for a BorrowerRate based on the same CreditScore.
## loan_reopen$IsBorrowerHomeowner: False
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0423 0.1474 0.2015 0.2059 0.2639 0.3600
## --------------------------------------------------------
## loan_reopen$IsBorrowerHomeowner: True
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1269 0.1765 0.1871 0.2489 0.3500
By looking at the graph, it’s hard to tell whether there is any difference in the rate between homeowner & non-homeowner. I run a summary statistic for these tow groups, and find that the median borrower rate for homeowners is 2%+ lower than the rate for Non-homeowners. And from the histogram of BorrowerRate by Homeowners, the count of homeowners is almost twice of the count of non-homeowners.
## loan_reopen$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0929 0.1434 0.1501 0.2064 0.2669
## --------------------------------------------------------
## loan_reopen$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0423 0.1274 0.1899 0.1985 0.2699 0.3600
## --------------------------------------------------------
## loan_reopen$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0669 0.1490 0.1870 0.1930 0.2319 0.3304
The table shows that for 12-month loan, the rate is much lower. And there isn’t big difference in terms of rate between 36-month and 60-month loans.
There is no obvious pattern between delinquencies and BorrowerRate. I will create a factor variable based the # of delinquencies and explore further in the multivariable section.
## NULL
## loan$Phase: After 2009
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1359 0.1875 0.1960 0.2574 0.3600
## --------------------------------------------------------
## loan$Phase: Before 2009
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1270 0.1700 0.1834 0.2365 0.4975
For loans before 2009, there is a huge amount of changeoffs at a borrower rate around or higher than 0.3. The plot shows more than half of the loans became chargedoff. For loans after 2009, the Chargedoff risk seems to jump up when the borrower rate is higher than 0.25. The worse the credit rating one has, the higher borrower rate he or she gets, and more likely of being delinquent.From the statistics, we can see the median borrower rate after 2009 is 18.75% which is higher than that of before 2009.Generally speaking, borrowers before the 2009 tend to get a lower rate from Prosper.
As the plot shows, the loans made between 2006 and 2008 had a very high percentage of chargedoff.After the relaunch in late 2009, the chargedoff rate became much lower. It seems Prosper has a much better model to evaluate qualifified loans and lower the risk for lenders.
## [1] 0.1264026
The correlation coefficient is 0.13. There is a weak correlation between the BorrowerRate vs. DebtToIncomeRatio.
From the upper plot, we see the median borrower rate gradually decreases as the total loans increase. The boxplot shows similar information. Although for borrowers with 6 and above transactions, the borrower rate is much lower, considering that there are very few customers borrowed that many times(data size is so small), it is not sufficient to conclude you will get a significant drop in rate if you borrow moren than 5 times.
The borrower rate is correlated with the CreditScore. To some extent, the higher the credit score is, the lower the interest rate the borrower gets. Also, the worse the credit rating one has, the higher borrower rate he or she gets. The borrowers applying for smaller amount (<10000) of loans tend to have a higher interest rate. Most of the borrowers with loan amount > 30000, regardless of their CreditScore, they obtain a rate lower than 0.2. This means larger loans tend to get better rates.
Chargedoff risk: For loans before 2009, there is a huge amount of changeoffs at a borrower rate around or higher than 0.3, more than half of the loans became chargedoff. For loans after 2009, the Chargedoff risk seems to jump up when the borrower rate is higher than 0.25. Comparing the chargedoff percentage by each quarter, the chargedoff percentage has been continuously decreasing since its relaunch in 2009. Prosper has a much better model to evaluate qualifified loans and lower the risk for lenders since its reopen.
The strongest relationship I found is the mean of BorrowerRate at certain CreditScore vs. the CreditScore. Also, the BorrowerRate has a strong relationship with the ProsperRating, the better the rating, the lower the rate.
The higher your prosper score is, the higher possibility that you can get a better score. Also,there are many people with CreditScore lower than 650 who have prior loans from prosper. This means Prosper is an attractive platform for them to get loans.
The data points are very condensed to the leftmost, which means the majority of borrowers have no delinquency history. For lower ratings like HR and E, there are more data points falling to the right side of the plot.
I created a new variable named “Delinquency_history”, if the customer had 0 or NA delinquencies in last 7 years, the value would be “None”, if the number is below 4, the value is “Below Average”, if the number is equal to or above 4, the value is “Above Average”. From the plot, we can see that dots with above average delinquency are much more condense on the left side of the graph, which means they have lower CreditScore. There is no obvious pattern between BorrowerRate and Delinquency history. ### BorrowerRate by Prosper Rating and LoanAmount
## [1] -0.4122378
## [1] 0.173795
However, based on the correlation coefficient(-0.41), there are some negative correlation between LoanOriginalAmount and BorrowerRate that a larger amount tends to get a lower rate. I further check the correlation coefficient of those two variables for customers with A rating, the value is 0.17. So I think the negative coefficient for overall correlation between Prosper Rating & Loan Amount is mainly driven by the good prosper ratings of borrowers with large loans.
Green is very dominant on the grid except for A and AA, which means most of Prosper customers are heavy users of bankcards(>=50% of credit utilization at the time when pulled their profile).As ProsperRating upgrades from HR to AA, the quantity of mild use dots increases, also appartmently the red is more dense on the right side of the green, which means generally speaking mild users of bankcard tend to have higher credit scores.However, there is no obvious pattern between BorrowerRate and bankcard utiliztion.
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
## [82] "CreditScore"
## [83] "Phase"
## [84] "BankCardUse"
## [85] "ReturnCustomer"
## [86] "ProsperScorelevel"
## [87] "Delinquency_history"
As shown on the scatterplot matrices, the borrower rate is correlated with credit score, the correlation coefficient is -0.516. BorrowerRate has some correlation with LoanAmount, the correlation coeffecient is -0.38.
##
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ I(CreditScore), data = loan_reopen)
## m2: lm(formula = I(BorrowerRate) ~ I(CreditScore) + LoanOriginalAmount,
## data = loan_reopen)
## m3: lm(formula = I(BorrowerRate) ~ I(CreditScore) + LoanOriginalAmount +
## InquiriesLast6Months, data = loan_reopen)
##
## =========================================================
## m1 m2 m3
## ---------------------------------------------------------
## (Intercept) 0.768*** 0.708*** 0.683***
## (0.003) (0.003) (0.003)
## I(CreditScore) -0.001*** -0.001*** -0.001***
## (0.000) (0.000) (0.000)
## LoanOriginalAmount -0.000*** -0.000***
## (0.000) (0.000)
## InquiriesLast6Months 0.011***
## (0.000)
## ---------------------------------------------------------
## R-squared 0.3 0.3 0.4
## adj. R-squared 0.3 0.3 0.4
## sigma 0.1 0.1 0.1
## F 29722.5 21761.9 17187.3
## p 0.0 0.0 0.0
## Log-likelihood 112679.1 117507.0 120085.3
## Deviance 350.9 313.2 294.8
## AIC -225352.2 -235006.0 -240160.6
## BIC -225324.1 -234968.6 -240113.8
## N 84984 84984 84984
## =========================================================
I use CreditScore, LoanOriginalAmount, InquiriesLast6Months to build a linear model to predict the BorrowerRate.The R-Squared is only 0.4, so there is still a lot more unexplained variations.
The borrower rate is highly correlation with EstimatedLoss rate, in variable description, the EstimatedLoss is explained as “Estimated loss is the estimated principal loss on charge-offs.”, which means it represents the risk of being chargeddoff, it should be estimated through a similar model to the one used to estimate the borrow rate.Both CreditScore and Loan amount have correlations with the BorrowerRate. Loan amount has an interesting impact on the rate you can get, it seems Prosper incentivizes customers to get larger loans by providing better rates.
Both bankcardUse and delinquency history have interactions with credit score. People with more delinquencies and more utilizations of bankcard tend to have a lower credit score.
The two plots show that the loans made between 2006 and 2008 had a very high percentage of chargedoffs, which led to its shut-down in late 2008.
After the relaunch in 2009, Prosper has moved away from the auction process to a new fixed rate model, in this way, Prosper could evaluate the borrowers’ qualification and decide his/her interest rate.
This renovated model proved to be very successful, as we saw the # of loans were continuously growing while the chargedoff rate was dropping since then.Prosper’s loan sales peaked in 2013 Q4, and most of loans are still current. In my opinion, it is very safe for investors, since Prosper has a solid loan approval model to lower the risk for lenders.
## loan_reopen$ProsperRating..Alpha.: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1827 0.1135 0.1221 0.1136 0.1246 0.1399
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0124 0.1054 0.1239 0.1247 0.1487 0.1843
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0045 0.1012 0.1163 0.1187 0.1414 0.2332
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.00910 0.08227 0.09220 0.09810 0.11050 0.26670
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.00100 0.07408 0.08215 0.08629 0.09260 0.28370
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01780 0.06081 0.06663 0.06965 0.07284 0.18310
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01460 0.04554 0.05100 0.05399 0.05540 0.19360
Estimated return for a loan is calculated at the time when it is listed, which considers the risk of loss, such as being late or defaulted.
Background info The loan industry is new to me, thus I spent a lot of time in understanding what is Prosper’s business model and what each variable stands for. Gaining adequate knowledge about the industry helps a lot in further analysis, in this case I could generate more insights instead of just merely presenting the data outcome.
Data munging and transformation: It is important to do neccesary data munging for the ease of further analysis, such as to convert date numerics into date format, sort the levels of categorial variables into desired order, create levels to simply representing a continously scale. Successful data wrangling will make your analysis more efficient and effective.
Insight generation from EDA: During initial exploration, I found the big difference in data before 2009 and after the relaunch. The reason behind is Prosper changed their business model to provide fixed rate other than auction, also Prosper improved the model to qualify the borrowers.So I made a decision to create a new subset which only containing the data after reopen, in this way I could avoid the influences from the old Prosper model, and have better analysis results.
Exploratory results: by conducting the exploratory data analysis, I am able to depict proper borrowers’ profiles. Also, I successfully discovered what factors could determine your rates, how safe it is to invest in prosper. Based on my findings, if you are a lender, it is very safe to invest in Prosper now, as the Chargedoff rate dropped significantly. As a borrower, to some extent, you will possibly get a better rate with a higher credit score, but there are still some other factors would impact your rate such as your recent credit inquiries, loan amout etc.
I was really struggling to come out a liner model with a good fit to predict the borrower rates. But I didn’t find other variables to be incorporated in the model to improve its performance. Also, when conducting bivariable analysis, there are many noises to influence the two variables you want to explore, so it is very critical to choose a good visualization (type of plot, color, facet_wrap etc) or find a way to compute statistics (mean, median) to describe & compare the two groups.
There are some other variables I haven’t explored yet, like the variables related to credit lines & revolving accounts. They may have important information as to predict the rates. On Prosper.com, it is saying that they use historical preformance data to make predictions, I think they deploy some machine learning techniques to make this happen. I am wondering what kind of algorithm they use. Also, I am very curious about the rejected loan applications. This data set only has listed & approved loans, how about the application turned down by prosper, what characteristics of them don’t meet prosper’s minimal requirement.